{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_tts_tts-python-basics/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
    "\n",
    "# How do I customize TTS pronunciations?\n",
    "\n",
    "This tutorial walks you through the basics of NeMo TTS pronunciation customization. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
    "Instructions for setting up Colab are as follows:\n",
    "1. Open a new Python 3 notebook.\n",
    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
    "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
    "4. Run this cell to set up dependencies.\n",
    "\"\"\"\n",
    "\n",
    "BRANCH = 'main'\n",
    "# # If you're using Google Colab and not running locally, uncomment and run this cell.\n",
    "# !apt-get install sox libsndfile1 ffmpeg\n",
    "# !pip install wget text-unidecode \n",
    "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grapheme-to-phoneme (G2P) Overview\n",
    "\n",
    "Modern **text-to-speech** (TTS) models can learn pronunciations from raw text input and its corresponding audio data.\n",
    "Sometimes, however, it is desirable to customize pronunciations, for example, for domain-specific terms. As a result, many TTS systems use grapheme and phonetic input during training to directly access and correct pronunciations at inference time.\n",
    "\n",
    "\n",
    "[The International Phonetic Alphabet (IPA)](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) and [ARPABET](https://en.wikipedia.org/wiki/ARPABET) are the most common phonetic alphabets. \n",
    "\n",
    "There are two ways to customize pronunciations:\n",
    "\n",
    "1. pass phonemes as an input to the TTS model, note that the request-time overrides are best suited for one-off adjustments\n",
    "2. configure TTS model with the desired domain-specific terms using custom phonetic dictionary\n",
    "\n",
    "Both methods require users to convert graphemes into phonemes (G2P). \n",
    "\n",
    "#### All words for G2P purposes could be divided into the following groups:\n",
    "* *known* words - words that are present in the model's phonetic dictionary\n",
    "* *out-of-vocabulary (OOV)* words - words that are missing from the model's phonetic dictionary. \n",
    "* *[heteronyms](https://en.wikipedia.org/wiki/Heteronym_&#40;linguistics&#41;)* - words with the same spelling but different pronunciations and/or meanings, e.g., *bass* (the fish) and *bass* (the musical instrument).\n",
    "\n",
    "#### Important NeMo flags:\n",
    "* `your_spec_generator_model.vocab.g2p.phoneme_dict` - phoneme dictionary that maps words to their phonetic transcriptions, e.g., [ARPABET-based CMU Dictionary](https://raw.githubusercontent.com/NVIDIA/NeMo/stable/scripts/tts_dataset_files/cmudict-0.7b_nv22.10) or [IPA-based CMU Dictionary](https://github.com/NVIDIA/NeMo/blob/stable/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt)\n",
    "* `your_spec_generator_model.vocab.g2p.heteronyms` - list of the model's heteronyms, grapheme form of these words will be used even if the word is present in the phoneme dictionary.\n",
    "* `your_spec_generator_model.vocab.g2p.ignore_ambiguous_words`: if is set to **True**, words with more than one phonetic representation in the pronunciation dictionary are ignored. This flag is relevant to the words with multiple valid phonetic transcriptions in the dictionary that are not in `your_spec_generator_model.vocab.g2p.heteronyms` list.\n",
    "* `your_spec_generator_model.vocab.phoneme_probability` - phoneme probability flag in the Tokenizer and the same from in the G2P module: `your_spec_generator_model.vocab.g2p.phoneme_probability` ([0, 1]). If a word is present in the phoneme dictionary, we still want our TTS model to see graphemes and phonemes during training to handle OOV words during inference. The `phoneme_probability` determines the probability of an unambiguous dictionary word appearing in phonetic form during model training, `(1 - phoneme_probability)` is the probability of the graphemes. This flag is set to `1` in the parse() method during inference.\n",
    "\n",
    "To ensure the desired pronunciation, we need to add a new entry to the model's phonetic dictionary. If the target word is already in the dictionary, we need to remove the default pronunciation so that only the target pronunciation is present. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Default G2P\n",
    "\n",
    "Below we show how to analyze default G2P output of the NeMo models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import nemo.collections.tts as nemo_tts\n",
    "import IPython.display as ipd\n",
    "\n",
    "# Load mel spectrogram generator\n",
    "spec_generator = nemo_tts.models.FastPitchModel.from_pretrained(\"tts_en_fastpitch_ipa\").eval()\n",
    "# Load vocoder\n",
    "vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name=\"tts_en_hifigan\").eval()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# helper functions\n",
    "\n",
    "def generate_audio(input_text):\n",
    "    # parse() sets phoneme probability to 1, i.e. dictionary phoneme transcriptions are used for known words\n",
    "    parsed = spec_generator.parse(input_text)\n",
    "    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)\n",
    "    audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)\n",
    "    display(ipd.Audio(audio.detach().to('cpu').numpy(), rate=22050))\n",
    "    \n",
    "def display_postprocessed_text(text):\n",
    "    # to use dictionary entries for known words, not needed for generate_audio() as parse() handles this\n",
    "    spec_generator.vocab.phoneme_probability = 1\n",
    "    spec_generator.vocab.g2p.phoneme_probability = 1\n",
    "    print(f\"Input before tokenization: {' '.join(spec_generator.vocab.g2p(text))}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"paracetamol can help reduce fever.\"\n",
    "generate_audio(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Expected results if you run the tutorial:\n",
    "<audio controls src=\"https://github.com/NVIDIA/NeMo/raw/main/tutorials/tts/audio_samples/default_ipa.wav\" type=\"audio/ogg\"></audio>  \n",
    "\n",
    "\n",
    "During preprocessing, unambiguous dictionary words are converted to phonemes, while OOV and words with multiple entries are kept as graphemes. For example, **paracetamol** is missing from the phoneme dictionary, and **can** has 2 forms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display_postprocessed_text(text)\n",
    "for word in [\"paracetamol\", \"can\"]:\n",
    "    word = word.upper()\n",
    "    phoneme_forms = spec_generator.vocab.g2p.phoneme_dict[word]\n",
    "    print(f\"Number of phoneme forms for wordPhoneme forms for '{word}': {len(phoneme_forms)} -- {phoneme_forms}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Input customization\n",
    "\n",
    "One can pass phonemes directly as input to the model to customize pronunciation.\n",
    "\n",
    "Let's replace the word **paracetamol** with the desired phonemic transcription in our example by adding vertical bars around each phone, e.g., `ˌpæɹəˈsitəmɔl` -> `|ˌ||p||æ||ɹ||ə||ˈ||s||i||t||ə||m||ɔ||l|`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Original text: {text}\")\n",
    "\n",
    "new_pronunciation = \"ˌpæɹəˈsitəmɔl\"\n",
    "phoneme_form = \"\".join([f\"|{s}|\" for s in new_pronunciation])\n",
    "text_with_phonemes = text.replace(\"paracetamol\", phoneme_form)\n",
    "print(f\"Text with phonemes: {text_with_phonemes}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_audio(text_with_phonemes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Expected results if you run the tutorial:\n",
    "<audio controls src=\"https://github.com/NVIDIA/NeMo/raw/main/tutorials/tts/audio_samples/phonemes_as_input.wav\" type=\"audio/ogg\"></audio>  \n",
    "\n",
    "\n",
    "## Dictionary customization\n",
    "\n",
    "Below we show how to customize phonetic dictionary for NeMo TTS models. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's add a new entry to the dictionary for the word **paracetamol**. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we download IPA-based CMU Dictionary and add a custom entry for the target word\n",
    "ipa_cmu_dict = \"ipa_cmudict-0.7b_nv23.01.txt\"\n",
    "if os.path.exists(ipa_cmu_dict):\n",
    "    ! rm $ipa_cmu_dict\n",
    "\n",
    "! wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tts_dataset_files/$ipa_cmu_dict\n",
    "\n",
    "with open(ipa_cmu_dict, \"a\") as f:\n",
    "    f.write(f\"PARACETAMOL  {new_pronunciation}\\n\")\n",
    "        \n",
    "! tail $ipa_cmu_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# let's now use our updated dictionary as the model's phonetic dictionary\n",
    "spec_generator.vocab.g2p.replace_dict(ipa_cmu_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Paracetamol** is no longer an OOV, and the model uses the phonetic form we provided:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display_postprocessed_text(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's use the new phoneme dictionary for synthesis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_audio(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Expected results if you run the tutorial:\n",
    "<audio controls src=\"https://github.com/NVIDIA/NeMo/raw/main/tutorials/tts/audio_samples/new_dict_entry.wav\" type=\"audio/ogg\"></audio>  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Resources\n",
    "* [TTS pipeline customization](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-custom.html#tts-pipeline-configuration)\n",
    "* [Overview of TTS in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/NeMo_TTS_Primer.ipynb)\n",
    "* [G2P models in NeMo](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tts/g2p.html)\n",
    "* [Riva TTS documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "38",
   "language": "python",
   "name": "38"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
